NumPy Internals
Before reading any further, predict what this prints:
import numpy as np
a = np.arange(16, dtype=np.float32).reshape(4, 4)
b = a[::2, ::2] # Every other row and column
b[0, 0] = 999
print(a[0, 0]) # ?
print(b.flags['OWNDATA']) # ?
print(b.strides) # ?
Most engineers answer: 0.0, False, (8, 8). The correct answers are 999.0, False, (32, 16).
Understanding why requires knowing exactly what a NumPy array is at the memory level. The strides are not (8, 8) because stepping over every other element in a 4-column row means jumping 16 bytes (2 elements * 8 bytes/element for float32? No - float32 is 4 bytes, so 2 * 4 = 8 bytes for columns, but rows are 4 * 4 = 16 bytes apart). Actually: float32 is 4 bytes. One row is 4 elements * 4 bytes = 16 bytes. Stepping over 2 rows = 32 bytes. Stepping over 2 columns = 8 bytes. So b.strides == (32, 16) -- and b.itemsize is still 4.
This one mental model -- strides -- unlocks everything else: broadcasting, views vs copies, memory layout, and why certain operations are 10x faster than alternatives.
What You Will Learn
- The ndarray memory model: data buffer, dtype, shape, strides, flags
- C-order vs Fortran-order and their performance implications
- How strides enable views without copying (with zero-allocation slicing)
- The three rules of broadcasting, and how to debug shape errors
- Views vs copies: what creates each, and the subtle bugs that arise
- Fancy indexing and boolean indexing: when they copy vs when they do not
- Ufuncs: built-in and custom, and why
np.vectorizeis not vectorisation - Advanced
np.where,np.einsum, and memory-efficient operations on large datasets - Interview patterns for NumPy-heavy ML engineering roles
Prerequisites
- Python Foundation and Python Intermediate complete
- Basic familiarity with NumPy:
np.array,np.arange,np.zeros, slicing,shape - Understanding of what a CPU cache is (L1/L2/L3)
Part 1 - The ndarray Memory Model
A NumPy ndarray is not a Python object that holds numbers. It is a lightweight Python wrapper around a contiguous block of raw memory - a typed, flat byte buffer.
import numpy as np
# This creates 4 * 8 = 32 bytes of contiguous memory containing
# four 64-bit IEEE 754 floats.
a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float64)
# The buffer lives at a specific memory address
print(f"Buffer address: {a.ctypes.data:x}")
# The ndarray object itself is tiny - it's just metadata
print(f"Itemsize: {a.itemsize}") # 8 bytes per float64
print(f"Nbytes: {a.nbytes}") # 32 bytes total
print(f"Dtype: {a.dtype}") # float64
print(f"Shape: {a.shape}") # (4,)
print(f"Strides: {a.strides}") # (8,) - 8 bytes to next element
print(f"Owns data: {a.flags['OWNDATA']}") # True - this array owns its buffer
Five attributes fully describe any ndarray:
| Attribute | Type | Meaning |
|---|---|---|
data | ctypes pointer | Address of the first byte of the buffer |
dtype | numpy.dtype | Data type of each element |
shape | tuple of int | Size along each dimension |
strides | tuple of int | Bytes to step to reach next element in each dimension |
flags | dict | OWNDATA, C_CONTIGUOUS, F_CONTIGUOUS, WRITEABLE |
The strides are the key. They are what allow NumPy to represent slices, transpositions, and reshapes as views into the same buffer without copying data.
How Strides Represent Multidimensional Arrays
import numpy as np
# A 3x4 float64 matrix
a = np.arange(12, dtype=np.float64).reshape(3, 4)
print(a)
# [[ 0. 1. 2. 3.]
# [ 4. 5. 6. 7.]
# [ 8. 9. 10. 11.]]
print(f"Shape: {a.shape}") # (3, 4)
print(f"Strides: {a.strides}") # (32, 8)
# 32 bytes to step from one row to the next (4 elements * 8 bytes)
# 8 bytes to step from one column to the next (1 element * 8 bytes)
# Element at [row, col] is at: data_ptr + row*32 + col*8
# a[1, 2] = data_ptr + 1*32 + 2*8 = data_ptr + 48 => value 6.0 ✓
The buffer in memory is simply: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].
The 2D structure is an illusion imposed by shape and strides.
C-Order vs Fortran-Order
The default memory layout in NumPy is C-order (row-major): elements in the same row are contiguous. Fortran-order is column-major: elements in the same column are contiguous.
import numpy as np
data = [[1, 2, 3],
[4, 5, 6]]
c_arr = np.array(data, order='C') # Default
f_arr = np.array(data, order='F')
print(f"C-order strides: {c_arr.strides}") # (24, 8) - row stride 24, col stride 8
print(f"F-order strides: {f_arr.strides}") # (8, 16) - row stride 8, col stride 16
# Peek at the raw bytes:
import ctypes
def show_buffer(arr):
ptr = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
return [ptr[i] for i in range(arr.size)]
print(f"C buffer: {show_buffer(c_arr)}") # [1, 2, 3, 4, 5, 6] - rows first
print(f"F buffer: {show_buffer(f_arr)}") # [1, 4, 2, 5, 3, 6] - cols first
Why does this matter? CPU cache lines are typically 64 bytes. When you access sequential memory addresses, the CPU pre-fetches adjacent bytes into cache. When you access non-sequential memory (large strides), every access is a cache miss.
import numpy as np
import time
n = 4096
a_c = np.random.randn(n, n) # C-order
a_f = np.asfortranarray(a_c.copy()) # F-order (same data, different layout)
# Row-sum: accesses memory in row order - C-order is cache-friendly
t0 = time.perf_counter()
for _ in range(20):
_ = a_c.sum(axis=1) # Sum along columns for each row
t_c_row = time.perf_counter() - t0
t0 = time.perf_counter()
for _ in range(20):
_ = a_f.sum(axis=1) # Same logical operation, but F-order strides are bad for row access
t_f_row = time.perf_counter() - t0
print(f"Row sum - C-order: {t_c_row:.3f}s, F-order: {t_f_row:.3f}s")
# Row sum - C-order: 0.18s, F-order: 0.52s (~3x slower)
# Column-sum: F-order wins
t0 = time.perf_counter()
for _ in range(20):
_ = a_c.sum(axis=0)
t_c_col = time.perf_counter() - t0
t0 = time.perf_counter()
for _ in range(20):
_ = a_f.sum(axis=0)
t_f_col = time.perf_counter() - t0
print(f"Col sum - C-order: {t_c_col:.3f}s, F-order: {t_f_col:.3f}s")
# Col sum - C-order: 0.51s, F-order: 0.19s (~3x slower)
:::tip Memory Layout Rule
- Default to C-order (the NumPy default). Most ML operations are row-wise (processing one sample at a time).
- Use F-order only when you have a heavy column-access workload, or when interfacing with Fortran libraries (LAPACK).
- PyTorch tensors default to C-order (called "contiguous" in PyTorch). A Fortran-order NumPy array converted to PyTorch will trigger an implicit copy. :::
Part 2 - Views vs Copies
This is the most practically important concept in NumPy for ML engineers. Getting it wrong means either bugs (modifying data you did not intend to modify) or performance problems (copying gigabytes of data unnecessarily).
What Creates a View
A view is an ndarray that shares the underlying buffer with another array. No memory allocation occurs.
import numpy as np
a = np.arange(24, dtype=np.float64).reshape(4, 6)
# --- VIEWS (no allocation) ---
v1 = a[1:3, :] # Slicing with basic indexing → view
v2 = a.reshape(6, 4) # Reshape → view (when contiguous)
v3 = a.T # Transpose → view (reverses strides)
v4 = a.ravel() # Flatten → view (when contiguous)
v5 = a[:, ::2] # Stride-2 column slice → view
# Confirm view status
for name, arr in [("slice", v1), ("reshape", v2), ("T", v3),
("ravel", v4), ("col stride", v5)]:
print(f"{name:12}: base is a = {arr.base is a}")
# All print True - they all share a's buffer
# Modification propagates back
v1[0, 0] = -99
print(a[1, 0]) # -99 - the original was modified!
What Creates a Copy
import numpy as np
a = np.arange(24, dtype=np.float64).reshape(4, 6)
# --- COPIES (new memory allocation) ---
c1 = a.copy() # Explicit copy
c2 = a[[0, 2], :] # Fancy indexing (non-contiguous index list)
c3 = a[a > 10] # Boolean indexing
c4 = np.concatenate([a, a]) # Concatenation always allocates
c5 = a.flatten() # flatten() always copies; ravel() does not
c6 = a.astype(np.float32) # Type conversion always copies
for name, arr in [("copy()", c1), ("fancy idx", c2), ("bool idx", c3),
("concat", c4), ("flatten", c5), ("astype", c6)]:
# base is None means the array owns its own data (not a view)
print(f"{name:12}: owns data = {arr.base is None}")
The Large-Array Trap
import numpy as np
import sys
# Allocate a large array (simulating a model's weight matrix or a data batch)
big = np.random.randn(10_000_000) # ~80 MB
# Take a tiny slice - but this is a VIEW, not a copy
tiny_view = big[:5]
# Even though we only need 5 elements, the entire 80 MB stays alive
del big
# big's buffer is NOT freed because tiny_view.base still references it
print(f"tiny_view.base is not None: {tiny_view.base is not None}")
# True - 80 MB still in memory
# Fix: copy the slice
big2 = np.random.randn(10_000_000)
tiny_copy = big2[:5].copy()
del big2
# Now the 80 MB is freed; tiny_copy has its own 40-byte buffer
print(f"tiny_copy.base is None: {tiny_copy.base is None}") # True
:::danger Views Keep the Entire Base Array Alive
When you slice a large array and store only the slice, Python cannot garbage collect the original buffer because the view holds a reference to it via .base. In long-running ML pipelines - where you process large batches and keep only aggregates - this causes memory to grow without bound. Always call .copy() on small slices when you are done with the large array.
:::
Part 3 - Strides in Depth: Sliding Windows and Tricks
Understanding strides directly enables a powerful pattern: creating overlapping views for rolling windows, patch extraction, and convolution-like operations.
import numpy as np
def sliding_window_view(arr: np.ndarray, window: int) -> np.ndarray:
"""
Return a (n - window + 1, window) view of arr with zero allocation.
Used for rolling statistics, sequence modelling, convolution.
"""
# arr has shape (n,) with strides (itemsize,)
# We want shape (n-window+1, window) with strides (itemsize, itemsize)
# i.e., each row i is arr[i:i+window], sharing the same buffer.
n = arr.shape[0]
if n < window:
raise ValueError(f"Array length {n} < window {window}")
shape = (n - window + 1, window)
# Step along the output rows by one element; step within a row by one element.
strides = (arr.strides[0], arr.strides[0])
return np.lib.stride_tricks.as_strided(arr, shape=shape, strides=strides)
# Demo: compute 5-day rolling mean via stride tricks
prices = np.array([100, 101, 103, 102, 105, 107, 106, 108], dtype=np.float64)
windows = sliding_window_view(prices, window=5)
print("Windows (no allocation):")
print(windows)
# [[100, 101, 103, 102, 105],
# [101, 103, 102, 105, 107],
# [103, 102, 105, 107, 106],
# [102, 105, 107, 106, 108]]
rolling_mean = windows.mean(axis=1)
print(f"Rolling mean: {rolling_mean}") # [102.2, 103.6, 104.6, 105.6]
Note that np.lib.stride_tricks.sliding_window_view was added in NumPy 1.20 and provides the same functionality with bounds checking. For production code, prefer the built-in. The manual version above illustrates what the built-in does under the hood.
import numpy as np
# Built-in (NumPy >= 1.20, recommended for production)
prices = np.arange(1000, dtype=np.float64)
windows = np.lib.stride_tricks.sliding_window_view(prices, window_shape=30)
# Shape: (971, 30) - 971 windows of 30 days each
# Zero allocation - it's a view
rolling_std = windows.std(axis=1) # Shape (971,) - rolling volatility
# Benchmark: stride view vs loop
import time
n = 100_000
data = np.random.randn(n)
w = 50
t0 = time.perf_counter()
windows = np.lib.stride_tricks.sliding_window_view(data, w)
result_stride = windows.mean(axis=1)
t_stride = time.perf_counter() - t0
t0 = time.perf_counter()
result_loop = np.array([data[i:i+w].mean() for i in range(n - w + 1)])
t_loop = time.perf_counter() - t0
print(f"Stride view: {t_stride*1000:.1f} ms")
print(f"Loop: {t_loop*1000:.1f} ms")
print(f"Speedup: {t_loop / t_stride:.0f}x")
# Stride view: 1.2 ms
# Loop: 45.0 ms
# Speedup: 37x
Image Patch Extraction
The same trick works in 2D for convolution or patch-based ML:
import numpy as np
def extract_patches(image: np.ndarray, patch_size: int, stride: int = 1) -> np.ndarray:
"""
Extract all patches from a 2D image as a view (no allocation).
Returns shape (n_patches_h, n_patches_w, patch_size, patch_size).
"""
h, w = image.shape
ps = patch_size
out_h = (h - ps) // stride + 1
out_w = (w - ps) // stride + 1
sh, sw = image.strides
out_shape = (out_h, out_w, ps, ps)
out_strides = (sh * stride, sw * stride, sh, sw)
return np.lib.stride_tricks.as_strided(image, shape=out_shape, strides=out_strides)
# 64x64 grayscale image, extract 8x8 patches
image = np.random.randint(0, 256, (64, 64), dtype=np.uint8)
patches = extract_patches(image, patch_size=8, stride=4)
print(f"Patch grid shape: {patches.shape}") # (15, 15, 8, 8)
print(f"Total patches: {patches.shape[0] * patches.shape[1]}") # 225
print(f"Memory for patches view: {patches.nbytes / 1024:.0f} KB") # technically ~3.5 KB (it's a view)
:::danger as_strided Is Unsafe
np.lib.stride_tricks.as_strided bypasses all bounds checking. It is possible to construct strides that read memory outside the array's buffer - with undefined, non-repeatable results. Always prefer np.lib.stride_tricks.sliding_window_view for the 1D case. For custom 2D strides, add explicit bounds checks or use Cython/numba.
:::
Part 4 - Broadcasting Rules
Broadcasting is NumPy's mechanism for applying operations between arrays of different shapes without explicit loops or copies. It is syntactic sugar for a specific class of outer-product-like operations.
The Three Rules
- Pad shapes on the left with 1s until both arrays have the same number of dimensions.
- Dimensions of size 1 are stretched to match the other array's size in that dimension.
- If dimensions are neither equal nor 1, raise
ValueError.
import numpy as np
# Example 1: scalar broadcast
a = np.array([1.0, 2.0, 3.0]) # shape (3,)
b = 10.0 # shape () → padded to (1,) → stretched to (3,)
print(a + b) # [11. 12. 13.]
# Example 2: 1D vs 2D
row = np.array([1, 2, 3]) # shape (3,) → padded to (1, 3)
col = np.array([[10], [20]]) # shape (2, 1)
# After padding: (1, 3) and (2, 1)
# After stretching: (2, 3) and (2, 3)
print(row + col)
# [[11 12 13]
# [21 22 23]]
print((row + col).shape) # (2, 3)
# Example 3: incompatible shapes
a = np.ones((3, 4))
b = np.ones((2, 4))
try:
_ = a + b
except ValueError as e:
print(e)
# operands could not be broadcast together with shapes (3,4) (2,4)
Step-by-Step Broadcasting Visualisation
Normalise 1000 samples of 5 features to zero mean, unit variance:
data shape: (1000, 5)
means shape: (5,) → padded to (1, 5) → stretched to (1000, 5)
stds shape: (5,) → padded to (1, 5) → stretched to (1000, 5)
(data - means): element-wise on aligned (1000, 5) shapes ✓
/ stds: element-wise on aligned (1000, 5) shapes ✓
import numpy as np
data = np.random.randn(1000, 5)
means = data.mean(axis=0) # (5,)
stds = data.std(axis=0) # (5,)
# Single vectorised expression - no loops, no intermediate (1000, 5) arrays
# for means or stds (they broadcast on the fly)
normalised = (data - means) / stds
print(f"Post-normalisation means: {normalised.mean(axis=0).round(6)}") # ~0
print(f"Post-normalisation stds: {normalised.std(axis=0).round(6)}") # ~1
Practical Broadcasting: Pairwise Distance
One of the most common ML operations - computing pairwise distances between sets of points - is elegantly expressed through broadcasting:
import numpy as np
def pairwise_euclidean(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
"""
Compute pairwise Euclidean distances between rows of X and rows of Y.
X: (m, d), Y: (n, d) → returns (m, n)
No loops.
"""
# Expand dims to enable broadcasting:
# X[:, np.newaxis, :] → (m, 1, d)
# Y[np.newaxis, :, :] → (1, n, d)
# diff → (m, n, d)
diff = X[:, np.newaxis, :] - Y[np.newaxis, :, :] # (m, n, d)
return np.sqrt((diff ** 2).sum(axis=2)) # (m, n)
X = np.random.randn(100, 10) # 100 query points in 10D
Y = np.random.randn(200, 10) # 200 reference points
dist = pairwise_euclidean(X, Y)
print(f"Distance matrix: {dist.shape}") # (100, 200)
# Nearest neighbour for each query point
nn_idx = dist.argmin(axis=1) # (100,) - index into Y rows
print(f"Nearest neighbour indices: {nn_idx[:5]}")
:::tip Debug Broadcasting with np.broadcast_shapes
When a shape error occurs, np.broadcast_shapes tells you what the result shape would be (or raises the same error in isolation), making it easier to find the mismatch:
np.broadcast_shapes((5, 1, 3), (4, 1)) # → (5, 4, 3)
np.broadcast_shapes((3,), (2,)) # → ValueError
:::
Part 5 - Indexing: Basic, Fancy, and Boolean
NumPy supports three indexing modes. The distinction between them determines whether you get a view or a copy.
Basic Indexing (View)
Basic indexing uses integers and slices. It always returns a view.
import numpy as np
a = np.arange(100).reshape(10, 10)
v1 = a[2, :] # Row 2 - view
v2 = a[:, 3] # Column 3 - view (1D)
v3 = a[1:4, 2:6] # Submatrix - view
v4 = a[::2, ::3] # Strided - view
v5 = a[..., -1] # Last column - view (ellipsis)
# All are views: modification propagates
v3[0, 0] = -1
print(a[1, 2]) # -1
Fancy Indexing (Copy)
Fancy indexing uses integer arrays or lists. It always returns a copy.
import numpy as np
a = np.arange(100, dtype=np.float64)
# Integer array index → copy
idx = np.array([0, 5, 10, 50, 99])
c1 = a[idx] # Copy
# 2D fancy index
a2d = np.arange(25).reshape(5, 5)
rows = np.array([0, 2, 4])
cols = np.array([1, 3, 0])
c2 = a2d[rows, cols] # a2d[0,1], a2d[2,3], a2d[4,0] - 1D copy
# Fancy indexing for sorting
sort_order = np.argsort(a) # Indices that would sort a
sorted_a = a[sort_order] # Copy in sorted order
Boolean Indexing (Copy)
Boolean indexing (masking) returns a copy of the selected elements.
import numpy as np
a = np.random.randn(1_000_000)
# Boolean mask - creates a copy of selected elements
mask = a > 2.0
outliers = a[mask] # Copy of elements > 2.0
print(f"Outliers: {len(outliers):,} ({len(outliers)/len(a):.2%})")
# In-place masking (modify without creating a new array)
a[mask] = 2.0 # Clip outliers in place - modifies original buffer
# Note: a[mask] = ... is safe to do in-place even though a[mask] would create a copy
# NumPy detects the assignment pattern and writes back directly
# Combining masks (vectorised logical operations)
data = np.random.randn(1_000_000)
valid = (data > -3) & (data < 3) & np.isfinite(data)
clean = data[valid]
# np.where: vectorised if-else
# Replace outliers with the mean (no Python loop)
mean = data[valid].mean()
result = np.where(valid, data, mean)
Advanced: np.ix_ for Outer Index Grids
import numpy as np
a = np.arange(25).reshape(5, 5)
# Select rows 0,2,4 and cols 1,3 - a 3x2 submatrix
# Wrong: a[[0,2,4], [1,3]] → selects (0,1), (2,3) - 1D result of 2 elements
# Right: use np.ix_ to create an outer index
rows = [0, 2, 4]
cols = [1, 3]
sub = a[np.ix_(rows, cols)]
print(sub.shape) # (3, 2) - correctly the 3x2 submatrix
print(sub)
# [[ 1 3]
# [11 13]
# [21 23]]
Part 6 - Ufuncs and Vectorised Operations
Universal functions (ufuncs) are NumPy's compiled C functions for element-wise operations. Every mathematical operation you write on NumPy arrays dispatches to a ufunc.
import numpy as np
a = np.array([1.0, 4.0, 9.0, 16.0])
# These are all ufuncs (same function, different syntax)
print(np.sqrt(a)) # [1. 2. 3. 4.]
print(np.add(a, 10)) # [11. 14. 19. 26.] same as a + 10
print(np.multiply(a, 2)) # [2. 8. 18. 32.] same as a * 2
print(np.maximum(a, 5)) # [5. 5. 9. 16.]
# Ufuncs support reduce, accumulate, outer
print(np.add.reduce(a)) # 30.0 (sum)
print(np.add.accumulate(a)) # [1. 5. 14. 30.] (cumulative sum)
print(np.multiply.outer([1,2,3], [10,20]))
# [[10 20]
# [20 40]
# [30 60]]
Why np.vectorize Is Not Vectorisation
import numpy as np
import timeit
# np.vectorize wraps a Python function in a loop
# It provides broadcasting semantics but NOT C-speed
@np.vectorize
def slow_clip(x, lo=-1.0, hi=1.0):
return max(lo, min(hi, x))
# True vectorised version using ufuncs
def fast_clip(x, lo=-1.0, hi=1.0):
return np.clip(x, lo, hi)
data = np.random.randn(1_000_000)
t_slow = timeit.timeit(lambda: slow_clip(data), number=5) / 5
t_fast = timeit.timeit(lambda: fast_clip(data), number=5) / 5
print(f"np.vectorize: {t_slow*1000:.1f} ms")
print(f"np.clip: {t_fast*1000:.1f} ms")
print(f"Speedup: {t_slow / t_fast:.0f}x")
# np.vectorize: 820.0 ms
# np.clip: 2.1 ms
# Speedup: 390x
np.vectorize is useful for prototyping and handling arbitrary Python objects, but it is a Python loop under the hood. For numeric code, always prefer the built-in ufuncs.
Writing Truly Vectorised Custom Functions
import numpy as np
# Leaky ReLU: the wrong way
@np.vectorize
def leaky_relu_slow(x, alpha=0.01):
return x if x > 0 else alpha * x
# The right way: compose ufuncs
def leaky_relu(x: np.ndarray, alpha: float = 0.01) -> np.ndarray:
return np.where(x > 0, x, alpha * x)
# Even better for this specific case: avoid computing alpha*x where x>0
def leaky_relu_optimised(x: np.ndarray, alpha: float = 0.01) -> np.ndarray:
out = x.copy()
mask = x < 0
out[mask] *= alpha
return out
# GELU activation (used in transformers): compose numpy ufuncs
def gelu(x: np.ndarray) -> np.ndarray:
# GELU(x) = x * Phi(x) where Phi is the standard normal CDF
# Approximation: x * 0.5 * (1 + tanh(sqrt(2/pi) * (x + 0.044715*x^3)))
coef = np.sqrt(2.0 / np.pi)
return x * 0.5 * (1.0 + np.tanh(coef * (x + 0.044715 * x ** 3)))
data = np.random.randn(10_000_000)
result = gelu(data)
print(f"GELU output range: [{result.min():.2f}, {result.max():.2f}]")
Part 7 - Memory-Efficient Operations for Large Datasets
In-Place Operations
import numpy as np
n = 50_000_000
a = np.random.randn(n) # ~400 MB
# This creates a temporary:
# b = a * 2 + 1 # allocates 400 MB for (a*2), then 400 MB for the result
# In-place avoids temporaries:
a *= 2 # in-place - no new allocation
a += 1 # in-place - no new allocation
# For compound expressions, use out= parameter
result = np.empty_like(a)
np.multiply(a, 3, out=result) # writes directly to result
np.add(result, 5, out=result) # writes directly to result
Chunked Processing
import numpy as np
def process_large_array_chunked(
data: np.ndarray,
chunk_size: int = 1_000_000,
) -> np.ndarray:
"""
Apply expensive transforms to a large array in chunks.
Avoids allocating the full transformed array twice.
"""
result = np.empty_like(data)
n = len(data)
for start in range(0, n, chunk_size):
end = min(start + chunk_size, n)
chunk = data[start:end] # View - no allocation
# Expensive transform
result[start:end] = np.log1p(np.abs(chunk)) * np.sign(chunk)
return result
# Test: 100M element array
data = np.random.randn(100_000_000)
result = process_large_array_chunked(data, chunk_size=5_000_000)
print(f"Processed {len(data):,} elements")
print(f"Result stats: mean={result.mean():.4f}, std={result.std():.4f}")
Memory-Mapped Arrays
import numpy as np
# Create a 2 GB memory-mapped array (stored on disk, not loaded into RAM)
shape = (250_000_000,) # 250M float64 = 2 GB
# Write data to disk
fp = np.memmap('/tmp/large_array.npy', dtype=np.float64, mode='w+', shape=shape)
fp[:] = np.random.randn(*shape)
fp.flush()
del fp
# Read back without loading into RAM
arr = np.memmap('/tmp/large_array.npy', dtype=np.float64, mode='r', shape=shape)
# Operations on slices only load the accessed pages into RAM
sample = arr[:1_000_000] # Only the first 8 MB loaded
print(f"Mean of first 1M: {sample.mean():.4f}")
# Full sum computed in chunks by NumPy (internal paging)
total = arr.sum()
print(f"Total sum: {total:.2f}")
Part 8 - Production Patterns and Common Pitfalls
Pattern: Vectorised Feature Engineering
import numpy as np
def engineer_features(
prices: np.ndarray, # (n,) daily close prices
volumes: np.ndarray, # (n,) daily volumes
returns_window: int = 20,
) -> np.ndarray:
"""
Compute technical features for a price series.
Returns (n, 7) feature matrix, no Python loops.
"""
n = len(prices)
assert len(volumes) == n
# 1. Log returns
log_returns = np.zeros(n)
log_returns[1:] = np.diff(np.log(prices))
# 2. Rolling mean (cumsum trick) - O(n) for any window
cs = np.concatenate([[0], np.cumsum(prices)])
rolling_mean = np.zeros(n)
w = returns_window
rolling_mean[w:] = (cs[w:] - cs[:-w]) / w
# 3. Rolling std (computational formula for variance)
cs2 = np.concatenate([[0], np.cumsum(prices ** 2)])
rolling_mean_sq = np.zeros(n)
rolling_mean_sq[w:] = (cs2[w:] - cs2[:-w]) / w
variance = np.maximum(rolling_mean_sq - rolling_mean ** 2, 0)
rolling_std = np.sqrt(variance)
# 4. Z-score (avoid div-by-zero)
z_score = np.where(rolling_std > 0, (prices - rolling_mean) / rolling_std, 0.0)
# 5. Price-volume trend (momentum signal)
price_change = np.zeros(n)
price_change[1:] = np.diff(prices)
obv = np.cumsum(np.sign(price_change) * volumes)
# 6. Log volume
log_volume = np.log1p(volumes)
# 7. Relative price to rolling mean
rel_price = np.where(rolling_mean > 0, prices / rolling_mean, 1.0)
return np.column_stack([
log_returns,
rolling_mean,
rolling_std,
z_score,
obv / obv.std() if obv.std() > 0 else obv,
log_volume,
rel_price,
])
# Test with simulated data
np.random.seed(42)
n = 10_000
prices = np.cumprod(1 + np.random.randn(n) * 0.01) * 100
volumes = np.random.exponential(1_000_000, n)
import time
t0 = time.perf_counter()
features = engineer_features(prices, volumes)
elapsed = time.perf_counter() - t0
print(f"Shape: {features.shape}") # (10000, 7)
print(f"Time: {elapsed*1000:.1f} ms")
# ~2 ms for 10K samples - pure vectorised NumPy
Common Pitfalls
import numpy as np
# PITFALL 1: Accidental copy inside a loop
# BAD: appending to a Python list, then converting
data_bad = []
for i in range(10_000):
data_bad.append(np.random.randn(100))
result_bad = np.array(data_bad) # One big allocation at the end - OK
# But: allocating 10,000 Python list entries + 10,000 small ndarrays is slow
# GOOD: pre-allocate and fill
result_good = np.empty((10_000, 100), dtype=np.float64)
for i in range(10_000):
result_good[i] = np.random.randn(100) # Write into pre-allocated row
# EVEN BETTER: batch the random generation
result_best = np.random.randn(10_000, 100)
# PITFALL 2: Float accumulation error in reduction
# Bad: accumulate in float32
a = np.ones(1_000_000, dtype=np.float32)
print(f"float32 sum: {a.sum()}") # Correct-ish: 1000000.0
a[0] += 1e-7
print(f"float32 sum after +1e-7: {a.sum()}") # May lose the small increment
# Good: use float64 for accumulations
print(f"float64 sum: {a.astype(np.float64).sum():.1f}")
# PITFALL 3: Implicit dtype promotion surprises
a = np.array([1, 2, 3], dtype=np.int8)
b = np.array([100, 200, 300], dtype=np.int16)
result = a + b
print(f"Result dtype: {result.dtype}") # int16 - promoted to widest
# In ML, mixing int8 activations with float32 weights gives float32 output
# Know your dtype conventions before mixing
# PITFALL 4: NaN contamination
data = np.array([1.0, 2.0, np.nan, 4.0, 5.0])
print(f"Mean with NaN: {data.mean()}") # nan - contaminated!
print(f"Nanmean: {np.nanmean(data)}") # 3.0 - correct
# Always validate inputs before training
has_nan = np.any(np.isnan(data))
has_inf = np.any(np.isinf(data))
print(f"Data has NaN: {has_nan}, Inf: {has_inf}")
Interview Patterns
These patterns appear in ML engineering interviews specifically about NumPy performance and correctness.
Pattern 1: Rolling operations without loops. Use the cumulative sum trick for O(n) rolling mean/variance. Know that np.lib.stride_tricks.sliding_window_view produces views.
Pattern 2: Argmax per group. Given parallel arrays of group IDs and values, find the index of the maximum value in each group without a loop:
import numpy as np
groups = np.array([0, 0, 1, 1, 1, 2, 2])
values = np.array([3, 1, 7, 2, 9, 4, 8], dtype=np.float64)
# Sort by group, then by value descending within group
sort_idx = np.lexsort((-values, groups)) # secondary key first
sorted_groups = groups[sort_idx]
sorted_values = values[sort_idx]
# First occurrence of each group after sorting is its argmax
_, first_occ = np.unique(sorted_groups, return_index=True)
argmax_per_group = sort_idx[first_occ]
print("Max indices per group:", argmax_per_group) # [0, 4, 6]
print("Max values per group:", values[argmax_per_group]) # [3, 9, 8]
Pattern 3: Boolean mask to count, sum, and index. Always prefer mask.sum() over len(arr[mask]) - the latter creates a copy first. Use np.where(mask) to get indices without materialising the selected values.
Pattern 4: Memory estimation before allocation. Before allocating a large array, check: n_rows * n_cols * itemsize / 1024**3 GB. Know common itemsizes: float64=8, float32=4, float16=2, int32=4, int8=1.
Key Takeaways
- An ndarray is a metadata wrapper (shape, strides, dtype) around a flat byte buffer. Shape and strides are all that differ between a row vector, a column vector, and a transposed matrix - the buffer is the same.
- C-order stores rows contiguously; Fortran-order stores columns contiguously. Match layout to your access pattern for 2-3x cache performance gains.
- Slicing with basic indexing (integers and slices) always returns a view. Fancy indexing (integer arrays) and boolean indexing always return copies. Knowing the difference prevents both bugs and unnecessary allocations.
- Views keep the entire base array alive. Copy small slices from large arrays when you are done with the original.
- Broadcasting pads shapes from the left with 1s and stretches size-1 dimensions. It is not magic - it is pointer arithmetic over the same buffer.
np.vectorizeis a loop wrapper, not vectorisation. Usenp.where,np.clip,np.searchsorted, and the built-in ufuncs for real C-speed element-wise operations.- Use the cumsum trick for O(n) rolling statistics. Use
sliding_window_viewfor O(1)-per-window sliding window computation. - Pre-allocate result arrays and write into them rather than appending. Pre-allocating once is always faster than repeated growing allocations.
Practice Problems
Level 1 - Predict the Output
Problem 1: What are the shape and strides of b?
import numpy as np
a = np.arange(30, dtype=np.float64).reshape(5, 6)
b = a[::2, ::3]
print(b.shape, b.strides)
Answer
b.shape = (3, 2) and b.strides = (96, 24).
Row stride: stepping 2 rows in a, each row is 6 * 8 = 48 bytes, so 2 * 48 = 96 bytes.
Column stride: stepping 3 columns in a, each column is 8 bytes, so 3 * 8 = 24 bytes.
Shape: floor((5-1)/2)+1 = 3 rows, floor((6-1)/3)+1 = 2 cols.
b is a view.
Problem 2: Does modifying c affect a?
import numpy as np
a = np.arange(10)
c = a[[0, 2, 4, 6, 8]]
c[0] = -999
print(a[0])
Answer
a[0] prints 0. Fancy indexing (a[[list]]) always creates a copy. c is independent of a.
Problem 3: What does this broadcast produce and why?
import numpy as np
a = np.ones((4, 1, 3))
b = np.ones((1, 5, 1))
print((a + b).shape)
Answer
(4, 5, 3). Align shapes: (4, 1, 3) and (1, 5, 1). Dimension by dimension: max(4,1)=4, max(1,5)=5, max(3,1)=3.
Level 2 - Debug Challenge
This rolling standard deviation function gives wrong results for the first window elements. Fix it without adding a Python loop.
import numpy as np
def rolling_std(data: np.ndarray, window: int) -> np.ndarray:
cs = np.cumsum(data)
cs2 = np.cumsum(data ** 2)
n = np.arange(1, len(data) + 1)
mean = cs / n
mean2 = cs2 / n
return np.sqrt(np.maximum(mean2 - mean ** 2, 0))
# Expected: nan or 0 for indices < window, correct std for indices >= window
data = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
print(rolling_std(data, 3))
Answer
The current implementation computes an expanding window std (using all elements from 0 to i), not a rolling window std. Fix using the cumsum trick with a prepended zero:
import numpy as np
def rolling_std_fixed(data: np.ndarray, window: int) -> np.ndarray:
result = np.full(len(data), np.nan)
cs = np.concatenate([[0.0], np.cumsum(data)])
cs2 = np.concatenate([[0.0], np.cumsum(data ** 2)])
# For index i >= window-1, the window is data[i-window+1 : i+1]
w = window
rolling_mean = (cs[w:] - cs[:-w]) / w # shape: (n - w + 1,)
rolling_m2 = (cs2[w:] - cs2[:-w]) / w
variance = np.maximum(rolling_m2 - rolling_mean ** 2, 0)
result[w - 1:] = np.sqrt(variance)
return result
data = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])
print(rolling_std_fixed(data, 3))
# [nan nan 1. 1. 1. 1. 1. 1. 1. 1.]
Level 3 - Design Challenge
You receive a stream of embeddings from a sentence encoder. Each embedding is a 768-dimensional float32 vector. You need to maintain a fixed-size circular buffer of the last 10,000 embeddings, and for each new embedding, retrieve the top-5 most similar embeddings from the buffer (cosine similarity). The system must handle 500 new embeddings per second with at most 50 ms latency per batch of 500.
Design the data structures and vectorised NumPy operations. Do not use a vector database library.
Solution Sketch
import numpy as np
class EmbeddingBuffer:
def __init__(self, capacity: int = 10_000, dim: int = 768):
self.capacity = capacity
self.dim = dim
# Pre-allocate the circular buffer - single allocation
self.buffer = np.zeros((capacity, dim), dtype=np.float32)
self.norms = np.zeros(capacity, dtype=np.float32) # pre-computed norms
self.head = 0 # next write position
self.filled = 0 # how many slots are valid
def add(self, embeddings: np.ndarray) -> None:
"""Add a batch of embeddings. embeddings: (k, 768)."""
k = len(embeddings)
# L2-normalise before storing (once, not per query)
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalised = embeddings / np.maximum(norms, 1e-9)
# Circular write - handle wrap-around
end = self.head + k
if end <= self.capacity:
self.buffer[self.head:end] = normalised
self.norms[self.head:end] = 1.0 # all pre-normalised
else:
first = self.capacity - self.head
self.buffer[self.head:] = normalised[:first]
self.buffer[:k - first] = normalised[first:]
self.head = end % self.capacity
self.filled = min(self.filled + k, self.capacity)
def top_k(self, query: np.ndarray, k: int = 5) -> tuple:
"""
Find top-k most similar embeddings to query.
query: (768,) - single query vector.
Returns (indices, scores), each shape (k,).
"""
q_norm = query / np.maximum(np.linalg.norm(query), 1e-9)
valid = self.buffer[:self.filled] # (filled, 768) - view
# Cosine similarity = dot product (since all vectors are normalised)
# q_norm: (768,), valid: (filled, 768) → scores: (filled,)
scores = valid @ q_norm # single BLAS GEMV call
# Partial sort: O(n + k log k) instead of O(n log n)
if self.filled <= k:
idx = np.argsort(-scores)
else:
idx = np.argpartition(-scores, k)[:k]
idx = idx[np.argsort(-scores[idx])] # sort just the top k
return idx, scores[idx]
# Test
buf = EmbeddingBuffer(capacity=10_000, dim=768)
# Simulate 500 embeddings arriving at once
batch = np.random.randn(500, 768).astype(np.float32)
buf.add(batch)
query = np.random.randn(768).astype(np.float32)
import time
t0 = time.perf_counter()
for _ in range(500):
idx, scores = buf.top_k(query, k=5)
elapsed = time.perf_counter() - t0
print(f"500 queries in {elapsed*1000:.1f} ms") # Well under 50 ms
Key design decisions: (1) pre-normalise on insert so queries only need a dot product; (2) use np.argpartition for O(n) partial sort instead of full sort; (3) pre-allocate the buffer rather than appending; (4) keep pre-computed norms to avoid redundant computation.
